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Abstract — Ensuring the usefulness of electronic data sources 
while providing necessary privacy guarantees is an impor- 
tant unsolved problem. This problem drives the need for 
an overarching analytical framework that can quantify the 
safety of personally identifiable information (privacy) while still 
providing a quantifable benefit (utility) to multiple legitimate 
information consumers. State of the art approaches have 
predominantly focused on privacy. This paper presents the 
first information-theoretic approach that promises an analytical 
model guaranteeing tight bounds of how much utility is possible 
for a given level of privacy and vice-versa. 

I. The Database Privacy Problem 

Information technology and electronic communications 
have been rapidly applied to almost every sphere of human 
activity, including commerce, medicine and social network- 
ing. The concomitant emergence of myriad large centralized 
searchable data repositories has made "leakage" of private 
information such as medical data, credit card information, 
or social security numbers via data correlation (inadvertently 
or by malicious design) highly probable and thus an impor- 
tant and urgent societal problem. Unlike the well-studied 
secrecy problem (e.g., [l]-[3]) in which the protocols or 
primitives make a sharp distinction between secret and non- 
secret data, in the privacy problem, disclosing data provides 
informational utility while enabling possible loss of privacy 
at the same time. In fact, in the course of a legitimate 
transaction, a user learns some public information, which 
is allowed and needs to be supported for the transaction to 
be meaningful, but at the same time he can also learn/infer 
private information, which needs to be prevented. Thus 
every user is (potentially) also an adversary. This drives 
the need for a unified analytical framework that can tell 
us unequivocally and precisely how safe private data can 
be (privacy) and simultaneously provide measurable benefit 
(utility) to multiple legitimate information consumers. 

It has been noted that utility and privacy are competing 
goals: perfect privacy can be achieved by publishing nothing 
at all, but this has no utility; perfect utility can be obtained 
by publishing the data exactly as received, but this offers 
no privacy [4]. Utility of a data source is potentially (but 
not necessarily) degraded when it is restricted or modified 
to uphold privacy requirements. The central problem of this 
paper is a precise quantification of the tradeoff between the 
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privacy needs of the respondents (individuals represented by 
the data) and the utility of the sanitized (published) data for 
any data source. 

Though the problem of privacy and information leakage 
has been studied for several decades by multiple research 
communities (e.g., [4]-[8] and the references therein), the 
proposed solutions have been both heuristic and application- 
specific. The recent groundbreaking theory of e-differential 
privacy [9] from the theoretical computer science community 
provides the first universal metric of privacy that applies 
to any numerical database. We seek to address the open 
question of a universal and analytical characterization that 
provides a tight privacy-utility tradeoff using tools and tech- 
niques from information theory. 

Rate distortion theory is a natural choice to study the 
utility-privacy tradeoff; utility can be quantified via fidelity 
which, in turn, is related to distortion, and privacy can be 
quantified via equivocation. Our key insight is captured in 
the following theorem which is presented in this paper: for 
a data source with private and public data, minimizing the 
information disclosure rate sufficiently to satisfy the desired 
utility for the public data is equivalent to maximizing the pri- 
vacy for the private data. In a sparsely referenced paper [10] 
from three decades ago, Yamamoto developed the tradeoff 
between rate, distortion, and equivocation for a specific and 
simple source model. In this paper, we show via the above 
summarized theorem that Yamamoto's formalism can be 
translated into the language of data disclosure. Furthermore, 
we develop a framework that allows us to model data sources, 
specifically databases, develop application independent util- 
ity and privacy metrics, quantify the fundamental bounds on 
the utility-privacy tradeoffs, and develop a side-information 
model for dealing with questions of external knowledge. 

The paper is organized as follows. We present channel 
model and preliminaries in Section |ll] The main result and 
the proof are developed in Section [III] We discuss the results 
and present numerical examples in Section [IV] We conclude 
in Section [V] 

II. The Database Privacy Problem 

A. Problem Definition 

While the problem of quantifying the utility/privacy prob- 
lem applies to all types of data sources, we start our 
study with databases because they are highly structured and 
historically better studied than other types of sources. A 
database is a table (matrix) whose rows represent individual 
entries and whose columns represent the attributes of each 
entry [5]. For example, the attributes of each entry in a 
healthcare database typically include name, address, social 



security number (SSN), gender, and a collection of medical 
information, and each entry contains the information per- 
taining to an individual. Messages from a user to a database 
are called queries and, in general, result in some numeric 
or non-numeric information from the database termed the 
response. 

The goal of privacy protection is to ensure that, to the 
extent possible, the user's knowledge is not increased beyond 
strict predefined limits by interacting with the database. 
The goal of utility provision is, generally, to maximize the 
amount of information that the user can receive. Depending 
on the relationships between attributes, and the distribution 
of the actual data, a response may contain information that 
can be inferred beyond what is explicitly included in the 
response. The privacy policy defines the information that 
should not be revealed explicitly or by inference to the user 
and depends on the context and the application. For example, 
in a database on health statistics, attributes such as name 
and SSN may be considered private data, whereas in a state 
motor vehicles database only the SSN is considered private. 
The challenge for privacy protection is to design databases 
such that responses do not reveal information contravening 
the privacy policy. 

B. Current Approaches and Metrics 

The problem of privacy in databases has a long and rich 
history stretching back to the 1970s and space restrictions 
preclude any attempt to do full justice to the different 
approaches that have been considered along the way. While 
there have been many heuristic approaches to privacy, we 
only present the major milestones in privacy research on 
creating quantitative privacy metrics. Since privacy is a 
requirement that appears in many diverse contexts, a robust 
and formal notion of privacy that satisifies most, if not all, 
requirements is a tricky proposition and there have been 
many attempts at a definition. The reader is referred to the 
excellent survey by Dwork [11] for a detailed history of the 
field. The problem of privacy was first exposed by census 
statisticians who were required to publish statistics related to 
census functions but without revealing any particulars of in- 
dividuals in the census databases. An early work by Dalenius 
[6] reveals the depth to which this problem was considered. 
Several early attempts were made to publish census data 
using ad hoc techniques such as sub-sampling. However, 
the first widely reported attempt at a formal definition of 
privacy was by Sweeney [7]. The concept of k-anonymity 
proposed by Sweeney captures the intuitive notion of privacy 
that every individual entry should be indistinguishable from 
(fc — 1) other entries for some large value of k. This notion of 
anonymity for database respondents is analogous to similar 
proposals that were made for anonymity on the Internet 
such as crowds [12]. More recently, researchers in the data 
mining community have proposed to quantify the privacy 
loss resulting from data disclosure as the mutual information 
between attribute values in the original and perturbed data 
sets, both modeled as random variables [8]. 

The approaches considered in the literature have centered 



on the correct application of perturbation (also called san- 
itation), which encompasses a general class of database 
modification techniques that ensure that a user interacts only 
with a modified database that is derived from the original 
(e.g.: [4], [6]-[8]). Most of the these perturbation approaches, 
with the exception of differential privacy-based ones, are 
heuristic and application-specific and often focus on additive 
noise approaches. 

Differential privacy: More recently, privacy approaches 
for statistical databases has been driven by the differential 
privacy definition [9], [13]— [15]. In these papers, the authors 
take the view that privacy of an individual in a database 
is related to the ability of an adversary to detect whether 
that individual's data is in that database or not. Motivated 
by cryptographic models, they formalize this intuition by 
defining the difference in the adversary's outputs when 
presented with two databases D and D that are identical 
except in one row. 

Definition! ( [11]): A function K, gives e-differential pri- 
vacy if for all databases D, D defined as above, and all S C 
Range(/C), 



Pr[K(D) eS]< exp{e) • Pr K [D e S 



(1) 



where the probability space in each case is over the coin flips 
of AC. 

It is important to make two observations regarding the 
above definition. First, the probabilities in definition Q] are 
over the actions of the function JC and not over the distri- 
bution of D; in other words, the definition is independent 
of the distribution from which D may be sampled. Second, 
Definition Q] guarantees that the presence or absence of an 
individual row in the database makes very little difference to 
the output of the adversary as required, and thus, provides a 
precise privacy guarantee to any individual in the database. 

More recently, Dwork et al. [15] also provide a mechanism 
for achieving e-differential privacy universally for statistical 
queries (queries that map subsets of database entries to real 
numbers) which we summarize below. Let Z ~ Lap(b) 
represent a Laplacian distributed random variable with pa- 
rameter b. If b = 1/e we have that the density at z is 
proportional to cxp(— 4|z|) and for any (z,z') such that 
\z — z'\ < 1, Pr(z) and Pr(z ) are within a factor of e e . The 
following proposition shows that it is possible to achieve 
e-differential privacy for a given statistical query class for 
suitable choice of the Laplacian parameter. 

Proposition 1 ( [15] ): For any statistical query / : D — > 
1Z, the mechanism L that adds independently generated noise 
to the output terms with distribution Lap(Af / e) guarantees 
e-differential privacy where A/ = max f(D) — f(D ) for 

D, D which are different in exactly one row. 

Proposition Q] is the most significant milestone in the 
theory of privacy because it provides a method to guarantee 
a strong but quantifiable notion of privacy for statistical 
databases independent of their content. Furthermore, the 
noise distribution can be chosen after seeing the query, so 



that the noise level can be adjusted adaptively when pre- 
sented with a sequence of queries. However, one constraint 
in using Proposition Q] to define e is that A/ may be difficult 
to estimate - a loose bound on A/ may result in an overly 
large e, thereby resulting in a possible degradation of utility. 

To date, privacy has been the main focus of most work 
in this area. Indeed, Dwork [9] says explicitly that privacy 
is paramount in their work. However, databases exist to be 
useful and implementing sanitization techniques may hurt 
the usefulness of the database while safeguarding privacy. 
In much of the earlier work on database privacy, the utility 
is implicit. For exmple, Sweeney assumes that the databases 
can be fc-anonymized and still maintain usefulness. However, 
without a relationship between k and some formal notion 
of usefulness, it is impossible to say what a reasonable 
value of k should be in reality. Similarly, utility in privacy- 
preserving techniques such as clustering [4] and histograms 
[16] is assumed to be guaranteed as a direct result of 
the methods used; for example, in [16] it is shown that 
approximation algorithms that can run on original histograms 
can also run on the sanitized histograms with a degradation 
of performance. Clustering, a common sanitization technique 
[4], [17], [18], is claimed to maintain utility as a result of the 
following property: all points in a cluster are mapped to the 
cluster center, so no point is moved more than the diameter 
of the largest cluster. 

The differential privacy model uses additive noise for 
sanitization which in turn suggests a utility metric related 
to the accuracy of the sanitized database. The Laplacian 
noise model was chosen for achieving differential privacy 
in part because the mean and mode are zero, in which case 
no noise is added in most cases. The privacy parameter e 
is inversely related to the variance of the added noise - a 
better privacy guarantee requires a smaller e which in turn 
implies higher variance. The accuracy of a sanitized database 
as a whole is inversely related to the privacy requirement. 
Determining the appropriate range of e so that both privacy 
and accuracy requirements are balanced requires knowledge 
of the specific application. As an example, in the case of 
learning, recent results [19] in the area of private learning 
bound the extent to which the performance (i.e. accuracy) of 
certain kinds of classifers degrade when the training data is 
sanitized using the L mechanism in Proposition Q] In such 
cases, it is possible to have both, differential privacy with a 
known e, as well as quantified utility loss for the application 
under consideration. 

C. Privacy vs. Secrecy 

It is important to contrast the privacy problem from 
the well-studied (cryptographic and information-theoretic) 
secrecy problem where the task is to stop specific information 
from being received by untrusted third parties (eavesdrop- 
pers, wire-tappers, and other kinds of adversaries). In the 
private information retrieval model [20], the privacy problem 
is inverted in that the adversary is the database from whom 
the user wants to keep his query secret. In the secure 
multi -party computation model [21], each player wishes to 



keep his entire input secret from the other players while 
jointly computing a function on all the inputs. In all these 
problems, a specific data item is clearly either secret or 
public, whereas in the privacy problem, the same data while 
providing informational utility to the user can reveal private 
information about the individuals represented by the data. 
This eliminates the possibility of using secrecy techniques 
such as a specific model of the adversary or of harnessing 
any computing [22] or physical advantages such as secret 
keys, channel differences, or side information [23]. 

III. An Information-Theoretic Approach 

A. Model for Databases 

Circumventing the semantic issue: In general, utility and 
privacy metrics tend to be application specific. Focusing our 
efforts on developing an analytical model, we propose to cap- 
ture a canonical database model and representative abstract 
metrics. Such a model will circumvent the classic privacy 
issues related to the semantics of the data by assuming 
that there exist forward and reverse maps of the data set 
to the proposed abstract format (for e.g., a string of bits or a 
sequence of real values). Such mappings are often implicitly 
assumed in the privacy literature [4], [8], [9]; our motivation 
for making it explicit is to separate the semantic issues from 
the abstraction and apply Shannon-theoretic techniques. 

Model: Our proposed model focuses on large databases 
with K attributes per entry. Let X k G Xk be a random 
variable denoting the k th attribute, k — 1,2,..., K, and 
let X = (Xi,X2, ■ ■ ■ ,Xk)- A database d with n rows is 
a sequence of n independent observations of X from the 
distribution 

Px(x) =Px 1 x 2 ...x K (x 1 ,x 2 ,.-.,xk) (2) 

which is assumed to be known to both the designers and 
users of the database. Our simplifying assumption of row 
independence holds generally (but not always) as correlation 
is typically across attributes and not across entries. We write 
X" = (X?,X$,...,X%) to denote the n independent 
observations of X. This database model is universal in the 
sense that most practical databases can be mapped to this 
model. 

A joint distribution in (f2| models the fact that the attributes 
in general are correlated and can reveal information about 
one another. In addition to the revealed information, a user 
of a database can have access to correlated side informa- 
tion from other information sources. We model the side- 
information as an n-length sequence Z n which is correlated 
with the database entries via a joint distribution pxz ( x ,z) . 

Public and private variables: We consider a general model 
in which some attributes need to be kept private while the 
source can reveal a function of some or all of the attributes. 
We write IC r and JCh to denote sets of private (subscript h 
for hidden) and public (subscript r for revealed) attributes, 
respectively, such that K, T UK/, = fC = {1, 2, . . . , K}. We 
further denote the corresponding collections of public and 
private attributes by X,. = {X k } keKr and X h = {X k } keKh , 
respectively. Our notation allows for an attribute to be both 



public and private; this is to account for the fact that a 
database may need to reveal a function of an attribute while 
keeping the attribute itself private. In general, a database 
can choose to keep public (or private) one or more attributes 
(K > 1). Irrespective of the number of private attributes, 
a non-zero utility results only when the database reveals an 
appropriate function of some or all of its attributes. 

Special cases: For K = 1, the lone attribute of each entry 
(row) is both public and private, and thus, we have X = 
X r = Xh- Such a model is appropriate for data mining 
[8]; for a more general case in which K] x = K r = K, we 
obtain a model for census [4], [6] data sets in which utility 
generally is achieved by revealing a function of every entry 
of the database while simultaneously ensuring that no entry 
is perfectly revealed. For K = 2 and ICh U K. r = K, and 
JCh H K. r = 0, we obtain the Yamamoto model in [10]. 

B. Metrics: The Privacy and Utility Principle 

Even though utility and privacy measures tend to be spe- 
cific to the application, there is a fundamental principle that 
unifies all these measures in the abstract domain. The aim 
of a privacy-preserving database is to provide some measure 
of utility to the user while at the same time guaranteeing a 
measure of privacy for the entries in the database. 

A user perceives the utility of a perturbed database to be 
high as long as the response is similar to the response of 
the unperturbed database; thus, the utility is highest of an 
unperturbed database and goes to zero when the perturbed 
database is completely unrelated to the original database. 
Accordingly, our utility metric is an appropriately chosen 
average 'distance' function between the original and the 
perturbed databases. Privacy, on the other hand, is maximized 
when the perturbed response is completely independent of 
the data. Our privacy metric measures the difficulty of 
extracting any private information from the response, i.e., 
the amount of uncertainty or equivocation about the private 
attributes given the response. 

C. Utility-Privacy Tradeoffs 

1) A Privacy-Utility Tradeoff Model: We now propose a 
privacy-utility model for databases. Our primary contribution 
is demonstrating the equivalence between the database pri- 
vacy problem and a source coding problem with additional 
privacy constraints. A primary motivation for our approach 
is the observation that database sanitization is traditionally 
the process of distorting the data to achieve some measure 
of privacy. For our abstract universal database model, san- 
itization is thus a problem of mapping a set of database 
entries to a different set subject to specific utility and privacy 
requirements. 

Our notation below relies on this abstraction. Recall that 
a database d with n rows is an instantiation of X™. Thus, 
we will henceforth refer to a real database d as an input 
sequence and to the corresponding sanitized database (SDB) 
d' as an output sequence. When the user has access to side 
information, the reconstructed sequence at the user will in 
general be different from the SDB sequence. 



Our coding scheme consists of an encoder Fe which 
is a mapping from the set of all input sequences (i.e., all 
databases d chosen from an underlying distribution) to a set 
of indices W = {1,2,...,M} and an associated table of 
output sequences (each of which is ad') with a one-to-one 
mapping to the set of indices given by 



F E : (X? X X$ X . . . x X%) 



W 



{SDB k }? =1 



(3) 

where K. r C /C enc C JC and M — 2 nR is the number of 
output (sanitized) sequences created from the set of all input 
sequences. The encoding rate R is the number of bits per row 
(without loss of generality, we assume n rows in d and d') of 
the sanitized database. The encoding Fe in (01 includes both 
public and private attributes in order to model the general 
case in which the sanitization depends on a subset of all 
attributes. 

A user with a view of the SDB (i.e., an index w £ W 
for every d) and with access to side information Z n , whose 
entries Zi, i = 1,2, ...,n, take values in the alphabet Z, 
reconstructs the database 61 via the mapping 



F D :WxZ 7 



, M 
r,mf m=1 



(4) 



where X£ = F D (F E (X 11 )). 

A database may need to satisfy multiple utility constraints 
for different (disjoint) subsets of attributes, and thus, we 
consider a general framework with L > 1 utility functions 
that need to be satisfied. Relying on the distance based utility 
principle, we model the I th utility, I = 1,2,. . .,L, via the 
requirement that the average distortion A/ of the revealed 
variables is upper bounded, for some e > 0, as 



Ul : A; 



E 



( 



<Di + e, 
Z = 1,2,...,L, 



(5) 



where <?(•,•) denotes a distortion function, E is the ex- 
pectation over the joint distribution of (X r ,X r ), and the 
subscript i in X r j and X r ,j denotes the i th entry of X" 
and X™, respectively. Examples of distance-based distortion 
functions include the Euclidean distance for Gaussian dis- 
tributed database entries, the Hamming distance for binary 
input and output sequences, and the Kullback-Leibler (K-L) 
'distance' comparing the input and output distributions. 

Having argued that a quantifiable uncertainty captures the 
underlying privacy principle of a database, we model the 
uncertainty or equivocation about the private variables using 
the entropy function as 



1 



p: A p = -H(Xl\W,Z n ) >E-e, 



(6) 



i.e., we require the average number of uncertain bits per 
entry to be lower bounded by E. The case in which side 
information is not available at the user is obtained by simply 
setting Z n = in © and ©. 

The utility and privacy metrics in (O and (0, respectively, 
capture two aspects of our universal model: a) both represent 



averages by computing the metrics across all database instan- 
tiations d, and b) the metrics bound the average distortion and 
privacy per entry. Thus, as the likelihood of the non-typical 
sequences decreases exponentially with increasing n (very 
large databases), these guarantees apply nearly uniformly to 
all (typical) entries. Our general model also encompasses 
the fact that the exact mapping from the distortion and 
equivocation domains to the utility and privacy domains, 
respectively, can depend on the application domain. We write 
D = (A, D 2 , . . . , D L ) and A = (A X) A 2 , . . . , A L ). Based 
on our notation thus far, we define the utility-privacy tradeoff 
region as follows. 

Definition 2: The utility-privacy tradeoff region T is the 
set of all feasible utility-privacy tuples (D, E) for which 
there exists a coding scheme (Fe, Fjj) given by (01 and (0J, 
respectively, with parameters (n,M, A,A P ) satisfying the 
constraints in (O and (|6]). 

2) Equivalence of Utility- Privacy and Rate -Distortion- 
Equivocation: We now present an argument for the equiv- 
alence of the above utility-privacy tradeoff analysis with a 
rate-distortion-equivocation analysis of the same source. For 
the database source model described here, a classic lossy 
source coding problem is defined as follows. 

Definition 3: The set of tuples (R, D) is said to be feasi- 
ble (achievable) if there exists a coding scheme given by (O 
and (|4|i with parameters (n, M, A) satisfying the constraints 
in (0 and a rate constraint 

M<2 n{R+e) . (7) 
When an additional privacy constraint in (JSJ is included, 
the source coding problem becomes one of determining 
the achievable rate-distortion-equivocation region defined as 
follows. 

Definition 4: The rate-distortion-equivocation region TZ is 
the set of all tuples (R, D, E) for which there exists a coding 
scheme given by (01 and (|4]l with parameters (n, M, A, A p ) 
satisfying the constraints in (0, ©, and (Q. The set of 
all feasible distortion-equivocation tuples (D,E) is denoted 
by 1Zd~e, the equivocation-distortion function in the D-E 
plane is denoted by F(D), and the distortion-equivocation 
function which quantifies the rate as a function of both D 
and E is denoted by R (D, E). 

Thus, a rate-distortion-equivocation code is by definition a 
(lossy) source code satisfying a set of distortion constraints 
that achieves a specific privacy level for every choice of the 
distortion tuple. In the following theorem, we present a basic 
result capturing the precise relationship between T and TZ. 
To the best of our knowledge, this is the first analytical result 
that quantifies a tight relationship between utility and privacy. 
We briefly sketch the proof here; details can be found in [24]. 

Theorem 1: For a database with a set of utility and privacy 
metrics, the tightest utility-privacy tradeoff region T is the 
distortion-equivocation region 1Zd-e- 

Proof: The crux of our argument is the fact that for 
any feasible utility level D, choosing the minimum rate 
R(D,E), ensures that the least amount of information is 
revealed about the source via the reconstructed variables. 



This in turn ensures that the maximum privacy of the private 
attributes is achieved for that utility since, in general, the 
public and private variables are correlated. For the same 
set of utility constraints, since such a rate requirement is 
not a part of the utility-privacy model, the resulting privacy 
achieved is at most as large as that in 1Zd-e (see Fig. 02a)). 

■ 

Implicit in the above argument is the fact that a utility- 
privacy achieving code does not perform any better than 
a rate-distortion-equivocation code in terms of achieving a 
lower rate (given by log 2 M/n) for the same distortion and 
privacy constraints. We can show this by arguing that if such 
a code exists then we can always find an equivalent source 
coding problem for which the code would violate Shannon's 
source coding theorem [25]. An immediate consequence of 
this is that a distortion-constrained source code suffices to 
preserve a desired level of privacy; in other words, the utility 
constraints require revealing data which in turn comes at a 
certain privacy cost that must be borne and vice-versa. We 
capture this observation in Fig.QJb) where we contrast exist- 
ing privacy-exclusive and utility-exclusive regimes (extreme 
points of the utility-privacy tradeoff curve) with our more 
general approach of determining the set of feasible utility- 
privacy tradeoff points. 

From an information-theoretic perspective, the power of 
Theorem Q] is that it allows us to study the larger problem 
of database utility-privacy tradeoffs in terms of a relatively 
familiar problem of source coding with privacy constraints. 
As noted previously, this problem has been studied for a 
specific source model by Yamamoto and here we expand his 
elegant analysis to arbitrary database models including those 
with side information at the user. Rate for the database can 
be interpreted as the number of revealed information bits 
(precision) per row. Our result shows the tight relationship 
between utility, privacy, and precision - fixing the value of 
any one determines the other two; for example, fixing the 
utility (distortion D) precisely quantifies the maximal privacy 
T(D) and the minimal precision R(D, E) for any E bounded 
byTp). 

3) Capturing the Effects of Side-Information: It has been 
illustrated that when a user has access to an external data 
source (which is not part of the database under consideration) 
the level of privacy that can be guaranteed changes [7], [9]. 
We cast this problem in information-theoretic terms as a side 
information problem. 

In an extended version [24] of this work, we develop the 
tightest utility-privacy tradeoff region for the three cases of 
a) no side information (L = 1 case studied in [10]), b) 
side information only at the user, and c) side information 
at both the source (database) and the user. We present a 
result for the case with side information at the user only 
and for simplicity, we assume a single utility function, i.e., 
L = 1. The proof uses an auxiliary random variable U along 
the lines of source coding with side information [26] and 
bounds the equivocation just as in [10, Appendix 1]. The 
following theorem defines the bounds on the region TZ in 
Definition [4] via the functions T(D) and R(D, E) where 
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Fig. 1. (a) Rate Distortion Equivocation Region; (b) Utility-Privacy Tradeoff Region. 



r(D) bounds the maximal achievable privacy and R(D,E) 
is the minimal information rate (see Fig. [Tf a )) f° r ver y large 
databases (n — > oo). The proof follows along the lines of 
Yamamoto's proof in [10, Appendix 1] and is skipped in the 
interest of space. 

Theorem 2: For a database with side information available 
only at the user, the functions T(D) and R(D,E) and the 
regions 1Zd-e and 1Z are given by 

r(D)= sup H{X h \UZ) (8) 

p(x r ,x h )p(ji|x r ,x h )e7 ; '(D) 

R(D,E) = inf I(X h X r ;U) - I(Z;U) 

p(x,.,x h ) P (M|x r ,x h )eP(z?,s) 

(9) 

K D -e = {{D,E): D > 0,0 < E < T {D)} (10) 

K = {(R, D,E) : D > 0, < E <T(D) ,R> R(D,E)} 

(11) 

where V (D, E) is the set of all p(x r , X/,, z)p(u\x r , X/j) such 
fh&tE[d(Xr,g(U,Z))] < D and H(X h \UZ) > E, while 
V (D) is defined as 

<E<H(X. h \Z) 

V(D,E). (12) 
While Theorem |2] applies to a variety of database mod- 
els, it is extremely useful in quantifying the utility-privacy 
tradeoff for the following special cases of interest. 

i) The single database problem (i.e., no side information): 
SDB is revealed. Here, we have Z — and U — X r , i.e., 
the reconstructed vectors seen by the user are the same as 
the SDB vectors. 

ii) Completely hidden private variables: Privacy is com- 
pletely a function of the statistical relationship between 
public, private, and side information data. The expression for 
R(D, E) in (|9]l assumes the most general model of encoding 
both the private and the public variables. When the private 
variables can only be deduced from the revealed variables, 
i.e., Xh — X r — U is a Markov chain, the expression for 
R(D, E) in (O will simplify to the Wyner-Ziv source coding 



formulation [26], thus clearly demonstrating that the privacy 
of the hidden variables is a function of both the correlation 
between the hidden and revealed variables and the distortion 
constraint. 

iii) Census and data mining problems without side infor- 
mation: Information rate completely determines the degree 
of privacy achievable. For Z = 0, setting X r = = X 
(such that U = X), we obtain the census/data mining 
problem discussed earlier. In general, due to an additional 
equivocation constraint, R(D, E) > R(D); however, for this 
case in which all the attributes in the database are public, 
since T(D) = H{X)~R(D, E) < H(X)-R(D), and R(D) 
is achievable using a rate-distortion code, the largest possible 
equivocation is also achievable. Our analysis thus formalizes 
the intuition in [8] for using the mutual information as an 
estimate of the privacy lost. However in contrast to [8] in 
which the underlying perturbation model is an additive noise 
model, we assume a perturbation model most appropriate for 
the input statistics, i.e., the stochastic relationship between 
the output and input variables is chosen to minimize the rate 
of information transfer. 

IV. Illustration of Results 

We illustrate our results for two types of databases: one, 
a categorical database and the other a numerical database. 
Categorical data are typically discrete data sets comprising 
of information such as gender, social security numbers and 
zipcodes that provide (meaningful) utility only if they are 
mapped within their own set. On the other hand, without 
loss of generality numeric data can be assumed to belong 
to the set of real numbers. In general, a database will have 
a mixture of categorical and numerical attributes but for the 
purpose of illustration, we assume that the database is of one 
type or the other, i.e., every attribute is of the same kind. In 
both cases, we assume a single utility (distortion) function. 
We discuss each example in detail below. 



Example 1: Consider a categorical database with K > 
1 attributes. In general, the k th attribute Xk takes values 
in a discrete set Xk of cardinality For our example, 
we model the utility as a single distortion function of all 
attributes, and therefore, it suffices to view each entry (a 
row of all K attributes) of the database as generated from 
a single source X of cardinality M, i.e., X ~ p(x), x £ 
{1,2,..., M}. For this arbitrary discrete source model, we 
assume that the output sample space X = X and consider 
the generalized Hamming distortion as the utility function 
such that the average distortion D is given by 



Y — X — X forms a Markov chain. The rate-distortion- 
equivocation region for this case can be obtained directly 
from Yamamoto's results [10] with appropriate substitution 
for a jointly Gaussian source. Furthermore, due to the 
Markov relationship between of X, Y, and X, the minimiza- 
tion of I(X\X) is strictly over p(x\x) : and thus, simplifies 
to the familiar rate-distortion problem for a Gaussian source 
X which in turn is achieved by choosing the reverse channel 
from X to X as an additive white Gaussian noise channel 
with variance D (average distortion). The maximal equivo- 
cation achieved thus is 



D 



d(X,X)]=Fr{x^x}. (13) r(D)=a Y [(l-p 2 )+p 2 D/a x ] 



D<a\. 



(17) 



For K = 1, one can show that R(D,E) = R(D) [24]; this 
is because the maximum achievable equivocation is bounded 
as T{D) = H{X)-R(D,E) < H(X)-R(D) with equality 
when R(D) is achievable. It has been shown by Erokhin [27] 
and Pinkston [28] that R(D) is achieved by upside down 
waterfilling such that 



p{x) 



(p(x) - A)" 



E X£X (p(x) A)+ 

and the 'test channel' is given by 

D, x — x 
p{x\x) = { A, x ^ x, x £ X supp 

Pki x = k >-f SU pp 



(14) 



(15) 



where D — 1 — D, A is chosen such that J2$ p(x)p(x\x) = 
p(x), pk — p (x — k), and X supp = {x : p{x) — A > 0} . The 
maximum achievable equivocation, and hence, the largest 
utility-privacy tradeoff region is 



r(D) = -DXogD- 



X K , 



A log A - 



p k \ogPk- (16) 



Remark 1: The distortion function chosen in ([TBI captures 
the fact that for categorical data the utility (fidelity) of the 
revealed data is reduced if any entry is changed from its 
original value. The optimal upside down waterfilling solution 
in (fl4l has the effect of 'flattening' the output distribution, 
and thus, as in (TBI the source samples with very high or 
very low probabilities (relative to the waterfilling level) are 
ignored (thereby minimizing the information transfer rate). 
This in turn maximizes the privacy achieved since the outliers 
that are easiest to infer are eliminated. Eliminating outliers, 
referred to as information suppression or aggregation, is 
the privacy-preserving technique of choice for the statistics 
community . 

Example 2: In this example we model a numerical 
database. We consider a K = 2 database where both 
attributes X and Y are jointly Gaussian with zero means 
and variances <j\ and a Y , respectively, and with correlation 
coefficient p = E[XY}/ (ax cry)- This model applies for 
numeric data such as height and weight measures which are 
generally assumed to be normally distributed. We assume 
that for every entry only one of the two attributes, say 
X, is revealed while the other, say Y, is hidden such that 



Therefore, T(D) is a minimum for D = (X revealed 
perfectly) in which case only the data independent of X 
in Y can be private, and is a maximum equal to the entropy 
of Y at the maximum distortion D = a\. Thus, the largest 
utility-privacy tradeoff region is simply the region enclosed 
by r(D). 

V. Concluding Remarks 

We have presented an abstract model for databases with an 
arbitrary number of public and private variables, developed 
application-independent privacy and utility metrics, and used 
rate distortion theory to determine the fundamental utility- 
privacy tradeoff limits. Future work includes eliminating the 
row independence (i.i.d) assumption, modeling and studying 
tradeoffs for multiple query databases, and relating current 
approaches in computer science and our universal approach. 
An equally pertinent question is to understand whether our 
formalism can be extended to study privacy-utility tradeoffs 
for less structured datasets as well as social networks. 
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